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Abstract 

In this paper we consider the problem of computing the longest common abelian 
factor (LCAF) between two given strings. We present a simple 0(cr n 2 ) time 
algorithm, where n is the length of the strings and a is the alphabet size, and a 
sub-quadratic running time solution for the binary string case, both having lin¬ 
ear space requirement. Furthermore, we present a modified algorithm applying 
some interesting tricks and experimentally show that the resulting algorithm 
runs faster. 


1. Introduction 

Abelian properties concerning words have been investigated since the very 
beginning of the study of Formal Languages and Combinatorics on Words. 
Abelian powers were first considered in 1961 by Erdos |Erd61j as a natural 
generalization of usual powers. In 1966, Parikh |Par66j defined a vector having 
length equal to the alphabet cardinality, which reports the number of occur¬ 
rences of each alphabet symbol inside a given string. Later on, the scientific 
community started referring to such a vector as the Parikh vector. Clearly, two 
strings having the same Parikh vector are permutations of one another and 
there is an abelian match between them. 

Abelian properties of strings have recently grown tremendous interest among 
the Stringology researchers and have become an involving topic of discussion in 
the recent issues of the StringMasters meetings. Despite the fact that there are 
not so many real life applications where comparing commutative sequence of 
objects is relevant, abelian combinatorics has a potential role in filtering the 
data in order to find potential occurrences of some approximate matches. For 
instance, when one is looking for typing errors in a natural language, it can be 
useful to select the abelian matches first and then look for swap of adjacent 
or even near appearing letters. The swap errors and the inversion errors are 
also very common in the evolutionary process of the genetic code of a living 
organism and hence is often interesting from Bioinformatics perspective. Similar 
applications can also be found in the context of network communications. 
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In this paper, we focus on the problem of finding the Longest Common 
Abelian Factor of two given strings. The problem is combinatorially interesting 
and analogous to the Longest Common Substring (LCStr) problem for the usual 
strings. The LCStr problem is a Historical problem and Dan Gusheld reported 
the following in his book (Gus97l. Sec. 7.4] regarding the belief of Don Knuth 
about the complexity of the problem: 

...in 1970 Don Knuth conjectured a linear time algorithm for this 
problem would be impossible. 

However, contrary to the above conjecture, decades later, a linear time solution 
for the LCStr problem was in fact obtained by using the linear construction of 
the suffix tree. For Stringology researchers this alone could be the motivation 
for considering LCAF from both algorithmic and combinatorics point of view. 
However, despite a number of works on abelian matching, to the best of our 
knowledge, this problem has never been considered until very recently when it 
was posed in the latest issue of the StringMasters, i.e., StringMasters 2013. To 
this end, this research work can be seen as a first attempt to solve this problem 
with the hope of many more to follow. 

In this paper, we first present a simple solution to the problem running in 
0(cr n 2 ) time, where a is the alphabet size (Section [ 3 ]). Then we present a sub¬ 
quadratic algorithm for the binary string case (Section [4]). Both the algorithms 
have linear space requirement. Furthermore, we present a modified algorithm 
applying some interesting tricks (Section [5j and experimentally show that the 
resulting algorithm runs in O(nlogn) time (Section [b]). 

2. Preliminaries 

An alphabet £ of size a > 0 is a finite set whose elements are called letters. 
A string on an alphabet £ is a finite, possibly empty, sequence of elements of £. 
The zero-letter sequence is called the empty string, and is denoted by e. The 
length of a string S is defined as the length of the sequence associated with the 
string S, and is denoted by |Sj. We denote by S[i] the i-th letter of S, for all 
1 < i < |<S'| and S = S[1.. |Sj]. A string w is a factor of a string S if there exist 
two strings u and v. possibly empty, such that S = uwv. A factor w of a string 
S is proper if w ^ S. If u = e (v = e), then w is a prefix (suffix) of S. 

Given a string S over the alphabet E = {a 1 ,...a cr }, we denote by |S , | aj 
the number of a/s in S, for 1 < j < a. We define the Parikh vector of S as 
Ps = (|5| 0 l ,...|5| 0 J. 

In the binary case, we denote E = {0,1}, the number of 0’s in S by |Sjo, the 
number of l’s in S by |Sji and the Parikh vector of SasVs = (|Sjo, |£ji). We 
now focus on binary strings. The general alphabet case will be considered later. 

For a given binary string S of length n, we define an n x n matrix M$ as 
follows. Each row of Ms is dedicated to a particular length of factors of S. 
So, Row £ of Ms is dedicated to ^-length factors of S. Each column of Ms 
is dedicated to a particular starting position of factors of S. So, Column i of 
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Mg is dedicated to the position i of S. Hence, Mg [t] [i] is dedicated to the l- 
length factor that starts at position i of S and it reports the number of l’s of 
that factor. Now, Afg[£][i] = m if and only if the ^-length factor that starts at 
position i of S has a total of m l’s, that is, |5[i .. i +1 — 1]1 1 = m. We formally 
define the matrix Mg as follows. 

Definition 1. Given a binary string S of length n, Mg is an n x n matrix such 
that Afg[£][i] = | S[i.. i +1 — l]|i, for 1 < i < n and 1 < i < (n — £ + 1), and 
Mg[^][i] = 0, otherwise. 

In what follows, we will use Mg[£\ to refer to Row £ of Mg. Assume that 
we are given two strings A and B on an alphabet £. For the sake of ease, we 
assume that |A| = \B\ == n. We want to find the length of a longest common 
abelian factor between A and B. 

Definition 2. Given two strings A and B over the alphabet £, we say that w 
is a common abelian factor for A and B if there exist a factor (or substring) u in 
A and a factor v in B such that V w = V u = V v . A common abelian factor of the 
highest length is called the Longest Common Abelian Factor (LCAF) between 
A and B. The length of LCAF is referred to as the LCAF length. 

In this paper we study the following problem. 

Problem 1 (LCAF Problem). Given two strings A and B over the alphabet 
£, compute the length of an LCAF and identify some occurrences of an LCAF 
between A and B . 

Assume that the strings A and B of length n are given. Now, suppose that 
the matrices Ma and Mg for the binary strings A and B have been computed. 
Now we have the following easy lemma that will be useful for us later. 

Lemma 2. There is a common abelian factor of length £ between A and B if 
and only if there exists p, q such that 1 < p, q < n —£+1 and Ma[£] [p] = Afe^jg]. 

Proof. Suppose there exists p , q such that 1 < p, q <n — l + l and AL^ [t) [p] = 
Afs[t][g]. By definition this means \A[p..p + £— l]|i = \B[q..q + £— 1]1 1 . So 
there is a common abelian factor of length £ between A and B. The other way 
is also obvious by definition. 

Clearly, if we have Ma and A lg we can compute the LCAF by identifying the 
highest t such that there exists p, q having l<p, q < n — £+1 and ALa[^][p] = 
Mb [£} [q\ ■ Then we can say that the LCAF between A and B is either A[p. ,p + 
£ — 1] or B[q .. q +1 — 1] having length £. 

We now generalize the definition of the matrix Mg for strings over a fixed 
size alphabet £ = {a\,... a a } by defining an n x n matrix Mg of (a — 1)- 
length vectors. Afg[f][i] = where = \S[i.. i +£ — l]|a 3 -, for 1 < £ < n, 

1 < i < (n — £+ 1) and 1 < j < a, and V(y[j] = 0, otherwise. We will refer to the 
j-th element of the array 1of the matrix Mg by using the notation Mg [H\ [i] \j\. 
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Notice that the last component of a Parikh vector is determined by using the 
length of the string and all the other components of the Parikh vector. Now, 
Ms [£] [*] [j] = to if and only if the l -length factor that starts at position i of S has 
a total of to Oj’s, that is |S[i . .i + £ — 1]| 0 . = to. Clearly, we can compute Mg\(\ 
using the following steps. Similar to the binary case, the above computation 
runs in linear time because we can compute \S[i + 1.. i + 1 + £ — 1]| a . from 
|5[z 1]| Q . in constant time by simply decrementing the component 

and incrementing the S[i + l\ one. 

3. A Quadratic Algorithm 

A simple approach for finding the LCAF length considers computing, for 
1 < £ < n, the Parikh vectors of all the factors of length £ in both A and B , 

i.e., Ma[£] and Mb[£]- Then, we check whether Ma[£\ and Mb[£\ have non¬ 
empty intersection. If yes, then £ could be the LCAF length. So, we return the 
highest of such £. Moreover, if one knows a Parikh vector having the LCAF 
length belonging to such intersection, a linear scan of A and B produces one 
occurrence of such a factor. The asymptotic time complexity of this approach 
is 0(cr n 2 ) and it requires 0(a nlogn ) bits of extra space. The basic steps are 
outlined as follows. 

1. For l = 1 to n do the following 

2. For j = 1 to n - £ + 1 do the following 

3. compute and Mg[f][i] 

4. If Ma [£] H M b [£} ^ 0 then 

5. LCAF = £ 

It is well known that, for fixed length £, one can compute all the Parikh 
vectors in linear time and store them in 0(a n log n) bits. Now once Ma and Mb 
are computed, we simply need to apply the idea of Lemma[2] The idea is to check 
for all values of £ whether there exists a pair p , q such that l<p,q<n — £+1 
and Ma[/|[p] = Me[£][q]. Then return the highest value of £ and corresponding 
values of p, q. 

In the binary case, a Parikh vector is fully represented by just one arbitrary 
chosen component. Hence, the set of Parikh vectors of binary factors is just 
a one dimension list of integers that can be stored in O (nlogn) bits, since we 
have n values in the range [0 .. n \. The intersection can be accomplished in two 
steps. First, we sort the Ma[£] and Mb[£) rows in 0(n) time by putting them in 
two lists and using the classic Counting Sort algorithm (CLR.S01 . Section 8.2]. 
Then, we check for a non empty intersection with a simple linear scan of the 
two lists in linear time by starting in parallel from the beginning of the two lists 
and moving forward element by element on the list having the smallest value 
among the two examined elements. A further linear scan of Ma[£] and Mb[£] 
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will find the indexes p , q of an element of the not empty intersection. This gives 
us an 0 (n 2 ) time algorithm requiring 0 (n\ogn) bits of space for computing an 
LCAF of two given binary strings. 

In the more general case of alphabet greater than two, comparing two Parikh 
vectors is no more a constant time operation and checking for empty intersec¬ 
tions is not a trivial task. In fact, sorting the set of vectors requires a full order 
to be defined. We can define an order component by component giving more 
value to the first component, then to the second one and so on. More formally, 
we define x < y, with x, y £ N CT , if there exist 1 > k > a such that x[k] < y[fc] 
and, for any i with 1 < i < k, x[i) = y[i\. Notice that comparing two vectors 
will take BO(cr) time. 

Now, one can sort two list of n vectors of dimension cr — 1, i.e., Ma \P\ and 
Mg[f], in O(o n) by using n comparisons taking 0(a) each. Therefore, now the 
algorithm runs in 0 (a n 2 ) time using 0 [a nlogtr) bits of extra space. 

4. A Sub-quadratic Algorithm for the Binary Case 

In Section [3j we have presented an 0(n 2 ) algorithm to compute the LCAF 
between two binary strings and two occurrences of common abelian factors, one 
in each string, having LCAF length. In this section, we show how we can achieve 
a better running time for the LCAF problem. We will make use of the recent 
data structure of Moosa and Rahman IMRIOj for indexing an abelian pattern. 
The results of Moosa and Rahman IMRIOj is presented in the form of following 
lemmas with appropriate rephrasing to facilitate our description. 

Lemma 3. (Interpolation lemma). If Si and S 2 are two substrings of a string 
S on a binary alphabet such that I = |5i| = i = j = |<S > 2 1 1 , j >i + 1, 

then, there exists another substring S 3 such that 1 = IS 3 I and i < IS 3 I 1 < j. 

Lemma 4. Suppose we are given a string S of length n on a binary alpha¬ 
bet. Suppose that maxOne(S, I) and minOne(S,I) denote, respectively, the 
maximum and minimum number of 1 ’s in any substring of S having length 
I. Then, for all 1 < £ < n, maxOne{S,t) and minOne(S,£) can be computed 
in 0 (n 2 / log n) time and linear space. 

A result similar to Lemma [3] is contained in the paper of Cicalese et al. 
Cl-1.0! ) Lemma 4], while the result of Lemma |4| has been discovered simulta¬ 
neously and independently by Moosa and Rahman [MRlOj and by Burcsi et al. 
jBCFLlOj . In addition to the above results we further use the following lemma. 

Lemma 5. Suppose we are given two binary strings A, B of length n each. 
There is a common abelian factor of A and B having length I if and only if 
maxOne(B,I) > minOne(A, I) and maxOne(A ,£) > minOne(B,l). 

Proof. Assume that minA = minOne(A,I), max a = maxOne(A,£), mins = 
minOne(B , £), maxs = maxOne{B , £). Now by Lennna[3j for all minA <^a< 
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maXA , we have some ^-length substrings A{kA) of A such that |A(/ca)|i = fcyt- 
Similarly, for all mins < ks < maxs, we have some f'-length factors B(fc) 
of -B such that |B(fcs)|i = ks- Now, consider the range [min a ■ ■ max a] and 
[minB ■ • maxs ]• Clearly, these two ranges overlap if and only if maxs •£. min a 
and max a •£. mins- If these two ranges overlap then there exists some k such 
that min a < k < max a and minB < k < maxs- Then we must have some 
substring Clength factors A(k) and B{k). Hence the result follows. 

Let us now focus on devising an algorithm for computing the LCAF given 
two binary strings A and B of length n. For all 1 < £ < n , we compute 
maxOne(A , £), minOne(A, £), maxOne(B , t) and minOne(B , £) in 0(n 2 / logn) 
time (Lemma [4]). Now we try to check the necessary and sufficient condition of 
Lemma [5] for all 1 < l < n starting from n down to 1. We compute the highest 
l such that 

[minOne(A, l).. maxOne(A, tj\ and [minOne(B , £).. maxOne(B , l)\ overlap. 

Suppose that K, is the set of values that is contained in the above overlap, that 
is K. = { k | k £ [ minOne(A , £).. maxOne(A 1 £)\ and k £ [minOne(B , £).. 
maxOne(B, £)] }. Then by Lemma[5j we must have a set S of common abelian 
factors of A, B such that for all S £ S, |5| = £. Since we identify the highest 
£, the length of a longest common factor must be £, i.e., LCAF length is £. 
Additionally, we have further identified the number of l’s in such longest factors 
in the form of the set /C. Also, note that for a k £ K, we must have a factor 
S £ S such that |5|i = k. 

Now let us focus on identifying an occurrence of the LCAF. There are a 
number of ways to do that. But a straightforward and conceptually easy way is 
to run the folklore ^-window based algorithm in IMRlOl on the strings A and B 
to find the Clength factor with number of l’s equal to a particular value k £ K.. 

The overall running time of the algorithm is deduced as follows. By Lemma[4j 
the computation of maxOne(A , £), minOne(A, £), maxOne(B, £) and minOne{B , £) 
can be done in 0{n 2 / log n) time and linear space. The checking of the condition 
of Lemma[5]can be done in constant time for a particular value of £. Therefore, 
in total, it can be done in O(n) time. Finally, the folklore algorithm requires 
G(n) time to identify an occurrence (or all of them) of the factors. In total the 
running time is 0 (n 2 / \ogn) and linear space. 

5. Towards a Better Time Complexity 

In this section we discuss a simple variant of the quadratic algorithm pre¬ 
sented in [3] We recall that the main idea of the quadratic solution is to find the 
greatest £ with Ma[£] |"| Mb[£] ^ 0- The variant we present here is based on the 
following two simple observations: 

1. One can start considering sets of factors of decreasing lengths; 
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2 . When an empty intersection is found between Ma[ f\ and Mb[£], some 
rows can possibly be skipped based on the evaluation of the gap between 
Ma[1] and Mg[<]. 

The first observation is trivial. The second observation is what we call the 
skip trick. Assume that Ma[£] and have been computed and Ma[£] fj Mb[£] 

0 have been found. It is easy to see that, for any starting position i and for any 
component j (i.e., a letter a.j) , we have 

M A [(][i\[j] - 1 <M a [£- 1][*]bl < M A[t\[i\\j] + 1 

Exploiting this property, we keep track, along the computation of Ma [£} and 
Mb[£], of the minimum and maximum values that appear in Parikh vectors of 
factors of length £. We use four arrays indexed by er, namely min a, max a, 
mins, max b- Notice that such arrays do not represent Parikh vectors as they 
just contain min and max values component by component. Formally, minA[j] = 
min{M ,4 [•£][*] [?']}, for any i = 1,.. .£ + 1. The others have similar definitions. 

We compare, component by component, the range of aj in A and B and 
we skip as many Rows as maxJ“ 1 1 (mmB[j] — maxA\j\), assuming mms[j] > 
maXA[j] (swap A and B 1 otherwise). The modified algorithm is reported in 
Algorithm 1. 

Note that the tricks employed in our skip trick algorithm are motivated by 
the fact that the expected value of the LCAF length of an independent and 
identically distributed (i.i.d.) source is exponentially close to n according to 
classic Large Deviation results |E1185| . The same result is classically extended 
to an ergodic source and it is meant to be a good approximation for real life 
data when the two strings follow the same probability distribution. Based on 
this, we have the following conjecture. 

Conjecture 6. The expected length of LCAF between two strings A, B drawn 
from an i.i.d. source is LCAF avg = n — O(logn), where |A| = \B\ = n, and the 
number of computed Rows in Algorithm^ is 0(logn) in average. 

Finally, we will make use of one more trick, that is, computing the first vector 
of the current row in constant time from the first vector of the previous row, 
when we skip some rows, instead of computing the new row from scratch, we 
can use the first vector of the row below to compute the first vector of the new 
row. When we compute the rows we need, we will just populate the required 
two lists and save a copy of the first vector of the computed row as we will need 
it along the next iterative steps as shown in Algorithm [2j 

For instance, if we know M[t] and we jump to M[£ — 3], i.e., we skip M\t— 1] 
and M [£ — 2], we take M {£] [1] and compute in constant time M \£ — 1] [1], M [£ — 
2][1], then again compute M[£ — 3][1]. From A4[tj[l], to compute M[£ — 1][1], 
we have to subtract 1 from the vector _M[^][1] at index s[£], that is the last 
character of the factor of length £ starting at position 1 (i.e., Ad[^][l]). For 
example, consider s = aacgcctaatcg , we have A![12][l] = (4a, 4c, 2g, 2t) and 
A4[ll][l] = (4a, 4c, lg,2t), i.e., (4a, 4c, 2 g, 2 t) minus 1 g. 
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Algorithm 1 Compute LCAF of x and y using the skip trick. 


1: function ComputeLCAF(:e, y) 

2 : set i = n = \x\ 

3: set found = False 

4: compute maxx = MAX(a:,^), maxy = MAX(y,f) 

5: compute minx = MIN(:e,£), miny = MIN(y,£) 

6 : if maxx == maxy then 

7: found = True 

8: else 

9 : t = l— SKIP ( minx, maxx , miny , maxy ) 

10 : end if 

11 : while (found == False) AND {£ > 0) do 

12 : compute max x = MAX(a;,£), min x = MIN(a;,£) 

13: compute maxy = MAX(y,f?), miny = MIN(y,f) 

14: compute lista; = M x [l\, listy = My[t] 

15: sort lista;, listy 

16: compute lista; f) H st y 

17: if lista; n listy 7 ^ 0 then 

18: found = True 

19: break 

20: end if 

21 : £ = l— SKIP (minx, maxx , miny, maxy) 

22: end while 

23: return £ 

24: end function 

25: function MAX(s, £) 

26: int countfcr],max [a] 

27: for (i = 1; i < £; i++) do 

28: Count [«[*]]++ 

29: end for 

30: max = count 

31: for (i = £; i < |s| — £; *++) do 

32: count[s[* — 1]]- - 

33: count [s[* + £ — 1]]++ 

34: if count[s[i + £ — 1]] > max[s[?' + £ — 1]] then 

35: max[s[i + £ — 1]] = count[s[i + £ — 1]] 

36: end if 

37: end for 

38: return max 

39: end function 

40: function SKip(minx,maxx,miny,maxy) 

41: int gap [a — 1] 

42: for ( j = 1; j < cr; j++) do 

43: if maa;a;[j] >= miny[j] then 

44: gap[j] = Imina; [j] - max y [j] \ 

45: else 

46: gap[j] = | min y [j] - maxg;[j}\ 

47: end if 

48: end for 

49: return max(gap) 

50: end function 




6. Experiments 

We have conducted some experiments to analyze the behaviour and running 
time of our skip trick algorithm in practice. The experiments have been run on 
a Windows Server 2008 R2 64-bit Operating System, with Intel(R) Core(TM) 
i7 2600 processor @ 3.40GHz having an installed memory (RAM) of 8.00 GB. 
Codes were implemented in C# language using Visual Studio 2010. 



2 4 6 8 10 12 14 16 

Length 


Figure 1: Plot of the average number of rows computed executing Algorithm [l] 
on all the strings of length 2 ,3,... 16 over the binary alphabet. 

Our first experiment have been carried out principally to verify our rationale 
behind using the skip trick. We experimentally evaluated the expected number 
of rows computed in average by using the skip trick of Algorithm [l] 



Figure 2: Plot of the average number of rows computed executing Algorithm [l] 
on both genomic and random datasets over the DNA alphabet. 


Figure [l] shows the average number of rows computed executing Algorithm 
[l]on all the strings of length 2,3,... 16 over the binary alphabet. Naive method 
line refers to the number of rows used without the skip trick, but starting from 
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i = n and decreasing i by one at each step. Notice that the skip trick line is 
always below the logn line. 

To this end we have conducted an experiment to evaluate the expected num¬ 
ber of rows computed by our skip trick algorithm. In particular, we have imple¬ 
mented the skip trick algorithm as well as the naive algorithm and have counted 
the average number of rows computed by the algorithms on all the strings of 
length 2,3,... 16 on binary alphabet. The results are reported in Figure |T| It 
shows that the computed rows of x, y , starting from f = ntof = n-logn, sum 
up to O(logn). 

On the other hand, to reach a conclusion in this aspect we would have to 
increase the value of n in our experiment to substantially more than 64; for 
n = 64, y/n is just above logn. Regrettably, limitation of computing power 
prevents us from doing such an experiment. So, we resort to two more (non- 
comprehensive) experimental setup as follows to check the practical running 
time of the skip trick algorithm. 



Figure 3: Plot of the average number of rows computed executing Algorithm 1 
on sequences taken from the Homo sapiens genome. 

Furthermore, we conduct our experiments on two datasets, real genomic data 
and random data. We have taken a sequence ( S ) from the Homo sapiens genome 
(250MB) for the former dataset. The latter dataset is generated randomly on 
the DNA alphabet (i.e., E = {a, c, g,t}). In particular, Here we have run the 
skip trick algorithm on 2 sets of pairs of strings of lengths 10, 20,.., 1000. For the 
genomic dataset, these pairs of strings have been created as follows. For each 
length t,l € {10, 20,.., 1000} two indexes i,j £ [l..|x| — £] have been randomly 
selected to get a pair of strings S[i..i + l — l],S[j..i + t — 1], each of length l. 
A total of 1000 pairs of strings have been generated in this way for each length 
£ and the skip trick algorithm has been run on these pairs to get the average 
results. On the other hand for random dataset, we simply generate the same 
number of strings pairs randomly and run the skip trick algorithm on each pair 
of strings and get the average results for each length group. In both cases, we 
basically count the numbers of computed rows. 
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Figure 4: Plot of the average number of rows computed executing Algorithm 1 
on randomly generated sequences over the alphabet E = {a, c, g,t}. 


Figure [2] shows the average number of rows computed executing Algorithm 

[1] on both genomic and random datasets over the DNA alphabet (i.e., E = 
{a,c, g,t}). Notice that the skip trick line is always below the logn line. Figure 

[2] shows that the computed rows of x,y 1 starting from £ = n to £ = n — log n, 
sum up to ©(logn). 

We experimentally evaluated the computing of the first vector and the ex¬ 
pected number of rows computed in average by employing the first vector trick 
(Algorithm |2j) . We have used the same experiment configuration as the above. 
The average number of rows and of the first vector computed executing Algo¬ 
rithm [ 2 ] on both genomic and random datasets over the DNA alphabet (i.e., 
E = {a, c, g,t}). In both cases, we basically count the numbers of computed 
rows and the first vector. The results are illustrated in Figures [3] and [4] 

In both cases, The figures report the average count of computed rows (Num¬ 
ber of Rows), the average count of the first vector (First Vector) and the sum¬ 
mation of these two counts (Total). It also shows the nlogn curve. Both of 
the figures show that the algorithm computed the first vector of the visited 
rows in ©(n) and the total running time for Algorithm [ 2 ] would be ©(nlogn) 
in practice. 

Since any row computation takes 0(a n), this suggests an average time 
complexity of 0 (a n logn), i.e., ©(n logn) for a constant alphabet. 

7. Conclusion 

In this paper we present a simple quadratic running time algorithm for the 
LCAF problem and a sub-quadratic running time solution for the binary string 
case, both having linear space requirement. Furthermore, we present a variant 
of the quadratic solution that is experimentally shown to achieve a better time 
complexity of ©(nlogn). 
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Algorithm 2 Compute LCAF of x and y using the first vector trick. 
1 : function first(s, £) 

2: int first [er] 

3: for (i = 1; i < i++) do 

4: first [s [*]]++ 

5: end for 

6: return first 

7: end function 

8 : function now(s, £, first) 

9: int row[cr] 

10: row =first 

11 : for (i = 1 ; i < |s| — £; *++) do 

12: row[s[i — 1]]- - 

13: rowjsfi + l — 1]]++ 

14: end for 

15: return row 

16: end function 

17: function ComputeLCAF(£c, y) 

18: set £ = n = |a:| 

19: set found = False 

20: compute first. x = FIRST(x,£) 

21: compute firsty = FIRST(y,£) 

22 : while (found == False) AND (£ > 0) do 

23: compute rowx = ROW(x,i, firstx) 

24: compute rowy = ROW(y, £, firsty ) 

25: compute lista; = M-x[P\, listy = My[£] 

26: sort lista;, lists 

27: compute lista; fl list?/ 

28: if lista? fl hsty ^ 0 then 

29: found = True 

30: break 

31: end if 

32: compute maxx = MAX(x,£), mirix = MIN(x,£) 

33: compute maxy = MAX(y,£), miny = MIN(y,£) 

34 : £ = £— SKIP ( minx, maxx , miny, maxy ) 

35: end while 

36: return £ 

37: end function 
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